16 research outputs found

    A Hybrid Multi-Filter Wrapper Feature Selection Method for Software Defect Predictors

    Get PDF
    Software Defect Prediction (SDP) is an approach used for identifying defect-prone software modules or components. It helps software engineer to optimally, allocate limited resources to defective software modules or components in the testing or maintenance phases of software development life cycle (SDLC). Nonetheless, the predictive performance of SDP models reckons largely on the quality of dataset utilized for training the predictive models. The high dimensionality of software metric features has been noted as a data quality problem which negatively affects the predictive performance of SDP models. Feature Selection (FS) is a well-known method for solving high dimensionality problem and can be divided into filter-based and wrapper-based methods. Filter-based FS has low computational cost, but the predictive performance of its classification algorithm on the filtered data cannot be guaranteed. On the contrary, wrapper-based FS have good predictive performance but with high computational cost and lack of generalizability. Therefore, this study proposes a hybrid multi-filter wrapper method for feature selection of relevant and irredundant features in software defect prediction. The proposed hybrid feature selection will be developed to take advantage of filter-filter and filter-wrapper relationships to give optimal feature subsets, reduce its evaluation cycle and subsequently improve SDP models overall predictive performance in terms of Accuracy, Precision and Recall values

    A Novel Multidimensional Reference Model For Heterogeneous Textual Datasets Using Context, Semantic And Syntactic Clues

    Get PDF
    With the advent of technology and use of latest devices, they produces voluminous data. Out of it, 80% of the data are unstructured and remaining 20% are structured and semi-structured. The produced data are in heterogeneous format and without following any standards. Among heterogeneous (structured, semi-structured and unstructured) data, textual data are nowadays used by industries for prediction and visualization of future challenges. Extracting useful information from it is really challenging for stakeholders due to lexical and semantic matching. Few studies have been solving this issue by using ontologies and semantic tools, but the main limitations of proposed work were the less coverage of multidimensional terms. To solve this problem, this study aims to produce a novel multidimensional reference model using linguistics categories for heterogeneous textual datasets. The categories such context, semantic and syntactic clues are focused along with their score. The main contribution of MRM is that it checks each tokens with each term based on indexing of linguistic categories such as synonym, antonym, formal, lexical word order and co-occurrence. The experiments show that the percentage of MRM is better than the state-of-the-art single dimension reference model in terms of more coverage, linguistics categories and heterogeneous datasets

    HABCSm: A Hamming Based t-way Strategy based on Hybrid Artificial Bee Colony for Variable Strength Test Sets Generation

    Get PDF
    Search-based software engineering that involves the deployment of meta-heuristics in applicable software processes has been gaining wide attention. Recently, researchers have been advocating the adoption of meta-heuristic algorithms for t-way testing strategies (where t points the interaction strength among parameters). Although helpful, no single meta-heuristic based t-way strategy can claim dominance over its counterparts. For this reason, the hybridization of meta-heuristic algorithms can help to ascertain the search capabilities of each by compensating for the limitations of one algorithm with the strength of others. Consequently, a new meta-heuristic based t-way strategy called Hybrid Artificial Bee Colony (HABCSm) strategy, based on merging the advantages of the Artificial Bee Colony (ABC) algorithm with the advantages of a Particle Swarm Optimization (PSO) algorithm is proposed in this paper. HABCSm is the first t-way strategy to adopt Hybrid Artificial Bee Colony (HABC) algorithm with Hamming distance as its core method for generating a final test set and the first to adopt the Hamming distance as the final selection criterion for enhancing the exploration of new solutions. The experimental results demonstrate that HABCSm provides superior competitive performance over its counterparts. Therefore, this finding contributes to the field of software testing by minimizing the number of test cases required for test execution

    Performance Analysis of Feature Selection Methods in Software Defect Prediction: A Search Method Approach

    No full text
    Software Defect Prediction (SDP) models are built using software metrics derived from software systems. The quality of SDP models depends largely on the quality of software metrics (dataset) used to build the SDP models. High dimensionality is one of the data quality problems that affect the performance of SDP models. Feature selection (FS) is a proven method for addressing the dimensionality problem. However, the choice of FS method for SDP is still a problem, as most of the empirical studies on FS methods for SDP produce contradictory and inconsistent quality outcomes. Those FS methods behave differently due to different underlining computational characteristics. This could be due to the choices of search methods used in FS because the impact of FS depends on the choice of search method. It is hence imperative to comparatively analyze the FS methods performance based on different search methods in SDP. In this paper, four filter feature ranking (FFR) and fourteen filter feature subset selection (FSS) methods were evaluated using four different classifiers over five software defect datasets obtained from the National Aeronautics and Space Administration (NASA) repository. The experimental analysis showed that the application of FS improves the predictive performance of classifiers and the performance of FS methods can vary across datasets and classifiers. In the FFR methods, Information Gain demonstrated the greatest improvements in the performance of the prediction models. In FSS methods, Consistency Feature Subset Selection based on Best First Search had the best influence on the prediction models. However, prediction models based on FFR proved to be more stable than those based on FSS methods. Hence, we conclude that FS methods improve the performance of SDP models, and that there is no single best FS method, as their performance varied according to datasets and the choice of the prediction model. However, we recommend the use of FFR methods as the prediction models based on FFR are more stable in terms of predictive performance

    Heterogeneous Ensemble with Combined Dimensionality Reduction for Social Spam Detection

    No full text
    This study presents a novel framework based on a heterogeneous ensemble method and a hybrid dimensionality reduction technique for spam detection in micro-blogging social networks. A hybrid of Information Gain (IG) and Principal Component Analysis (PCA) (dimensionality reduction) was implemented for the selection of important features and a heterogeneous ensemble consisting of Naïve Bayes (NB), K Nearest Neighbor (KNN), Logistic Regression (LR) and Repeated Incremental Pruning to Produce Error Reduction (RIPPER) classifiers based on Average of Probabilities (AOP) was used for spam detection. The proposed framework was applied on MPI_SWS and SAC’13 Tip spam datasets and the developed models were evaluated based on accuracy, precision, recall, f-measure, and area under the curve (AUC). From the experimental results, the proposed framework (that is, Ensemble + IG + PCA) outperformed other experimented methods on studied spam datasets. Specifically, the proposed method had an average accuracy value of 87.5%, an average precision score of 0.877, an average recall value of 0.845, an average F-measure value of 0.872 and an average AUC value of 0.943. Also, the proposed method had better performance than some existing methods. Consequently, this study has shown that addressing high dimensionality in spam datasets, in this case, a hybrid of IG and PCA with a heterogeneous ensemble method can produce a more effective method for detecting spam contents

    Visual Signifier for Large Multi-Touch Display to Support Interaction in a Virtual Museum Interface

    No full text
    The signifier is regarded as a crucial part of interface design since it ensures that the user can manage the device appropriately and understand the interaction that is taking place. Useful signifiers keep users’ attention on learning, but poorly designed signifiers can disrupt learning by slowing progress and making it harder to use the interface. The problem is that prior research identified the qualities of signifiers, but their attributes in terms of being visually apparent in broad interaction areas were not well recognized. Implementing the signifier without sufficient visual features such as a picture, figure or gesture may interfere with the user’s ability to navigate the surface, particularly when dealing with domains that demand “leisure exploration,” such as those in culture and heritage, and notably the museum application. As technology has evolved and expanded, adopting a multi-touch tabletop as a medium of viewing should be advantageous in conserving cultural heritage. As technology advances and improves, employing a multi-touch tabletop as a public viewing medium should be advantageous in maintaining cultural heritage. Some visual elements should be incorporated into the signifier to produce a conspicuous presentation and make it easier for users to identify. In this study, a preliminary study, a card sorting survey, and a high-fidelity experiment were used to investigate users’ experience, perspective, and interpretation of the visual signifier of the museum interface for large displays. This work offered a set of integrated visual signifiers on a big multi-touch display that makes a substantial contribution to supporting navigation and interaction on a large display, therefore aiding comprehension of the exhibited information visualization

    Data Harmonization for Heterogeneous Datasets: A Systematic Literature Review

    No full text
    As data size increases drastically, its variety also increases. Investigating such heterogeneous data is one of the most challenging tasks in information management and data analytics. The heterogeneity and decentralization of data sources affect data visualization and prediction, thereby influencing analytical results accordingly. Data harmonization (DH) corresponds to a field that unifies the representation of such a disparate nature of data. Over the years, multiple solutions have been developed to minimize the heterogeneity aspects and disparity in formats of big-data types. In this study, a systematic review of the literature was conducted to assess the state-of-the-art DH techniques. This study aimed to understand the issues faced due to heterogeneity, the need for DH and the techniques that deal with substantial heterogeneous textual datasets. The process produced 1355 articles, but among them, only 70 articles were found to be relevant through inclusion and exclusion criteria methods. The result shows that the heterogeneity of structured, semi-structured, and unstructured (SSU) data can be managed by using DH and its core techniques, such as text preprocessing, Natural Language Preprocessing (NLP), machine learning (ML), and deep learning (DL). These techniques are applied to many real-world applications centered on the information-retrieval domain. Several assessment criteria were implemented to measure the efficiency of these techniques, such as precision, recall, F-1, accuracy, and time. A detailed explanation of each research question, common techniques, and performance measures is also discussed. Lastly, we present readers with a detailed discussion of the existing work, contributions, and managerial and academic implications, along with the conclusion, limitations, and future research directions

    Malicious URLs Detection Using Data Streaming Algorithms

    No full text
    As a result of the advancement in technology and technological devices, data is now spawned at an infinite rate, emanating from a vast array of networks, devices as well daily operations like credit card transactions and mobile phones. Data stream entails sequential and real-time continuous data in the inform of evolving stream. However, the traditional machine learning approach is characterized by a batch learning model in which labelled training data are given apriori to train a model based on some machine learning algorithms. This technique necessitates the entire training samples to be readily accessible before the learning process. In this setting, the training procedure is mostly done in an offline environment owing to the high cost of training. Consequently, traditional batch learning technique suffers from some serious drawbacks, such as poor scalability for the real-time phishing websites detection, because the model mostly requires re-training from scratch using new training samples. Thus, this paper presents the application of streaming algorithms for detecting malicious URLs based on some selected online learners which include: Hoeffding Tree (HT), NaĂŻve Bayes (NB), and Ozabag. Hence, experimental results on two prominent phishing datasets showed that Ozabag produced promising results in terms of accuracy, Kappa and Kappa Temp on the dataset with large samples while HT and NB have the least prediction time with comparable accuracy and Kappa with Ozabag algorithm for the real-time detection of phishing websites

    Software Requirement Risk Prediction Using Enhanced Fuzzy Induction Models

    No full text
    The development of most modern software systems is accompanied by a significant level of uncertainty, which can be attributed to the unanticipated activities that may occur throughout the software development process. As these modern software systems become more complex and drawn out, escalating software project failure rates have become a critical concern. These unforeseeable uncertainties are known as software risks, and they emerge from many risk factors inherent to the numerous activities comprising the software development lifecycle (SDLC). Consequently, these software risks have resulted in massive revenue losses for software organizations. Hence, it is imperative to address these software risks, to curb future software system failures. The subjective risk assessment (SRM) method is regarded as a viable solution to software risk problems. However, it is inherently reliant on humans and, therefore, in certain situations, imprecise, due to its dependence on an expert’s knowledge and experience. In addition, the SRM does not allow repeatability, as expertise is not easily exchanged across the different units working on a software project. Developing intelligent modelling methods that may offer more unbiased, reproducible, and explainable decision-making assistance in risk management is crucial. Hence, this research proposes enhanced fuzzy induction models for software requirement risk prediction. Specifically, the fuzzy unordered rule induction algorithm (FURIA), and its enhanced variants based on nested subset selection dichotomies, are developed for software requirement risk prediction. The suggested fuzzy induction models are based on the use of effective rule-stretching methods for the prediction process. Additionally, the proposed FURIA method is enhanced through the introduction of nested subset selection dichotomy concepts into its prediction process. The prediction performances of the proposed models are evaluated using a benchmark dataset, and are then compared with existing machine learning (ML)-based and rule-based software risk prediction models. From the experimental results, it was observed that the FURIA performed comparably, in most cases, to the rule-based and ML-based models. However, the FURIA nested dichotomy variants were superior in performance to the conventional FURIA method, and rule-based and ML-based methods, with the least accuracy, area under the curve (AUC), and Mathew’s correlation coefficient (MCC), with values of approximately 98%
    corecore